Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees

Matteo Ceccarello

University of Padova

Joint work with Johann Gamper (U. Bolzano)

Outline

  1. Problem definition
  2. State of the art
  3. Locality Sensitive Hashing
  4. Our algorithm
  5. Experimental results

Timeseries and subsequences

Fix a subsequence length \(w\)

  • The Euclidean distance measures the similarity between subsequences
  • Subsequences can be seen as points in a \(w\) dimensional space

This timeseries is from an experiment recording the behavior of a Circulifer tenellus insect. In particular, the subsequences below are associated with a particular behavior while feeding.

Top-\(k\) motifs

Definition 1 Given a time series \(T\) and a length \(w\), the top-\(k\) motifs are the \(k\) pairs of subsequences of \(T\) of length \(w\) with smallest distance, such that no two subsequences in any pair overlap with each other.

Our contribution

  • We propose Attimo, a randomized algorithm for discovering the exact top-\(k\) motifs
  • Attimo auto tunes its parameters to the data distribution
  • Attimo allows to control the success probability, i.e. its recall
  • Attimo can handle time series of one billion values in just half an hour

State of the art: Matrix Profile

Algorithm

  • For each subsequence, find the nearest neighbor
  • Find the subsequence with the closest nearest neighbor

Pros

  • Leverage the fact that consecutive subsequences share a lot of structure
  • Parallelize with many GPUs
  • Finds motifs out of 1 billion long time series in one day, using 5 GPUs

Cons

  • it’s a \(\Omega(n^2)\) algorithm, for timeseries of length \(n\)

Goal

  • Find the top motifs without computing all-pairs-distances

Challenges

  • Subsequences can be seen as vectors
  • These vectors are high dimensional
  • Curse of dimensionality: indices such as R-trees degrade to linear scans

Locality Sensitive Hashing

Subsequences of length \(w\) are points in \(R^w\),
with Euclidean distance.

  • choose \(r \in \mathbb{R}^+\)
  • sample \(\vec{a} \sim \mathcal{N}(0,1)^w\), \(b \sim \mathcal{U}(0,r)\)

\[ h(\vec{x}) = \left\lfloor\frac{\vec{a} \cdot \vec{x} + b}{r}\right\rfloor \]

The key point is that we only compute the distance of subsequences falling
into the same bucket.

To lower the collision probability we concatenate \(\tau\) hash functions \[ \hat{h}(\vec{x}) = \langle h_1(\vec{x}), \dots, h_\tau(\vec{x}) \rangle \] this makes for a better precision of the index.

And to increase the recall of the index we repeat \(L\) times.

Computing the success probability

  • Consider a distance \(d\)
  • Assume to have done \(L\) LSH repetitions with \(\tau\) concatenations
  • We can compute the probability of having seen at least once all pairs at distance \(d\)

A simple algorithm

Auto tuning parameters

  • How do we set \(\tau\)?
  • The wrong value might make us do too many repetitions
  • We adopt an approach that auto tunes the parameters based on the data at hand

  • \(L_{max}\) maximum number of repetitions,
  • \(\tau_{max}\) maximum number of concatenations (e.g. 4),
  • \(\delta\) probability parameter: succeed with probability \(\ge 1-\delta\).

Repetition 1

Repetition 2

Guarantees

Theorem 1 Our algorithm finds the exact top-\(k\) motifs each with probability \(\ge 1-\delta\).

  • For \(\delta=0.1\), each true motif is found with 90% probability
  • This means that we can control the recall of the algorithm
  • Tradeoff: the higher the desired recall, the slower the discovery process

Complexity

Theorem 2 The LSH index construction takes time \[ O(\tau_{max} \cdot \sqrt{L_{max}} n\log n) \]

Theorem 3 Let \(d(m_k)\) be the distance of the \(k\)-th motif, and \(i'\le \tau_{max}\), \(j' \le L_{max}\) be the parameters used by the optimal LSH algorithm. Then, the algorithm considers \[ O\left( j'\sum_{m\in T^w\times T^w} p(d(m))^{i'} + (L_{max}-j')\sum_{m\in T^w\times T^w} p(d(m))^{i'+1} \right) \] pairs in expectation.

Experimental results

Find the top-10 motifs.

dataset \(n\) (millions) RC
astro 1 8.63
GAP 2 9.17
freezer 7 7.95
ECG 7 109.06
HumanY 26 581.03
Whales 308 21.66
Seismic 1000 274.44

Relative Contrast measures difficulty: higher is easier. \[ RC = \frac{d_{avg}}{d_{motif}} \]

Scalability

Synthetic data, planted motif.

Practical takeaways

  • For shorter time series, or if the relative contrast is very small, use the Matrix Profile.

  • For time series of a few million values and above, with a not-to-small relative contrast, try Attimo

References

  • Matteo Ceccarello, Johann Gamper: Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees. Proc. VLDB Endow. 15(13): 3841-3853 (2022)

https://github.com/Cecca/attimo



import pyattimo
ts = pyattimo.load_dataset("ecg", prefix=1000000)
motifs = pyattimo.MotifsIterator(
    ts, 
    w=1000
)
m = next(motifs)
matteo.ceccarello@unipd.it

Appendix

Influence of \(L_{max}\)

Running for top-10 motifs, for different number of repetitions.

Difficult datasets

Data from LIGO:

  • Length 100k, window 1k
  • Top motif distance \(\approx 40\)
  • Average distance \(\approx 44\)
  • Relative contrast \(\approx 1.1\)
  • Attimo: 6 hours
  • SCAMP: \(\approx 1\) second

freezer and the 7-th motif

7M points, window 5000

  • In black are the distances of the top-10 motifs extracted from the matrix profile.
  • In red the distance of a pair of subsequences neither of which is the nearest neighbor of the other, and not overlapping with higher-ranked motifs.
  • The matrix profile holds the distances and indices of the 1-nearest neighbor of each subsequence, but top-k motifs would require the k-nearest neighbors to be maintained in the matrix profile.

Top-k and k-NN

  • Solid lines are nearest neighbor distances
  • The dashed line is the distance of the top-2 pair in the definition we are using
  • The Matrix Profile contains information about 1-nearest neighbors (solid black lines)